Ensemble stacking rockburst prediction model based on Yeo–Johnson, K-means SMOTE, and optimal rockburst feature dimension determination

Rockburst forecasting plays a crucial role in prevention and control of rockburst disaster. To improve the accuracy of rockburst prediction at the data structure and algorithm levels, the Yeo–Johnson transform, K-means SMOTE oversampling, and optimal rockburst feature dimension determination are used to optimize the data structure. At the algorithm optimization level, ensemble stacking rockburst prediction is performed based on the data structure optimization. First, to solve the problem of many outliers and data imbalance in the distribution of rockburst data, the Yeo–Johnson transform and k-means SMOTE algorithm are respectively used to solve the problems. Then, based on six original rockburst features, 21 new features are generated using the PolynomialFeatures function in Sklearn. Principal component analysis (PCA) dimensionality reduction is applied to eliminate the correlations between the 27 features. Thirteen types of machine learning algorithms are used to predict datasets that retain different numbers of features after dimensionality reduction to determine the optimal rockburst feature dimension. Finally, the 14-feature rockburst dataset is used as the input for integrated stacking. The results show that the ensemble stacking model based on Yeo–Johnson, K-means SMOTE, and optimal rockburst feature dimension determination can improve the accuracy of rockburst prediction by 0.1602–0.3636. Compared with the 13 single machine learning models without data preprocessing, this data structure optimization and algorithm optimization method effectively improves the accuracy of rockburst prediction.

Rockbursts have always been a difficult and important topic in rock mechanics research. Rockbursts are frequent geological disasters in the construction of water conservancy, hydropower, and transportation projects; deep mining; geological treatment of nuclear waste; and deep physical underground laboratories 1,2 . They not only affect the construction process but also threaten lives and property 3,4 . If the occurrence of a rockburst can be accurately predicted and protective measures taken in advance to reduce the occurrence of rockbursts, staff casualties and property losses can be greatly reduced. Therefore, rockburst prediction has attracted much attention in the past two decades 5 .
There are four categories of rockburst prediction methods: empirical methods 6-8 , simulation techniques 9-11 , mathematical algorithms [12][13][14] , and monitoring technologies [15][16][17] . Both empirical methods and simulation techniques use similar simulation tests, and there is a certain gap between the rockburst phenomenon in the laboratory and the actual engineering. Monitoring technologies it is difficult to determine the threshold of rockburst, and the monitoring equipment is easily damaged during underground mining. With the advent of big data and artificial intelligence, rockburst prediction research methods based on mathematical algorithms have become increasingly prominent. Initially, a single model was used for rockburst prediction. For example, Feng et al. 18 built a support vector machine model, Zhou et al. 19  www.nature.com/scientificreports/ eliminate the correlations between rockburst features, and ensure the features are independent. Finally, the exhaustive method is used to identify the number of features that provides the highest average accuracy under the 13 machine learning algorithms. Rockburst prediction based on stacking is presented in section "Rockburst prediction based on ensemble stacking". A rockburst dataset comprising 14 features is input for stacking. Then, an appropriate model is selected from the 13 machine learning algorithms as the first layer base model for stacking. The logistic regression classifier is used as the second output model for stacking, which obtains the rockburst prediction results. The XgBoost model, which has the highest learning ability for the rockburst dataset comprising 14 features, is compared with the ensemble stacking model. The results demonstrate the advantages of the ensemble stacking model.

Rockburst data acquisition and analysis
Although there are numerous records of rockburst cases around the world, the impact factors of rockbursts in related cases are very limited. A total of 275 rockburst case samples with no duplicates or missing values were collected from the literature. The overall sample includes 51 groups of no rockburst occurrence (I) samples, 74 groups of weak rockburst (II) samples, 117 groups of moderate rockburst (III) samples, and 33 groups of strong rockburst (IV) samples. Based on previous research in machine learning and a comprehensive evaluation of rockburst influencing factors used in rockburst prediction, the following were selected as the rockburst prediction features: the maximum tangential stress of the surrounding rock ( σ θ ), uniaxial compressive strength of the rock ( σ c ), uniaxial tensile strength of the rock ( σ t ), rock elastic strain energy index ( W et ), rock stress coefficient SR(σ θ /σ c )and rock brittleness coefficient BR(σ c /σ t ).
A comprehensive understanding of the rockburst dataset characteristics is a prerequisite for data structure optimization. Therefore, the statistical parameters of the features for each rockburst grade are listed in Table 1. The relevant rockburst classification standards [30][31][32] are summarized in Table 2, which combines domestic and international criteria and engineering cases of rockbursts as well as the classification standard for the Qinling Tunnel in China and the classification standard for rockbursts suggested by the Ministry of Railways. Figure 1 shows the proportion of each grade of rockburst, and Fig. 2 shows the overlaid histograms of each feature in the rockburst dataset.
In Fig. 1 Table 2. Rockburst classification standards.   Table 2, that as the values of features σ θ , σ c , σ t , W et , and SR increase, the rockburst grade increases accordingly. In contrast, the rockburst grade increases as the value of feature BR decreases. However, the average value of SR in grade II (0.51) is greater than that for grade III (0.48), while the average value of BR in grade II (22.94) is less than that in grade III (0.48). This is inconsistent with the increasing and decreasing trends of the rockburst classification standards. The reason for this phenomenon is that the maximum and minimum values of the features for each rockburst grade are quite different, the coefficients of variation of the rockburst features are large, and there are many outliers. As shown in Fig. 2, there are many outliers for each feature for different rockburst grades, and the sparse outliers are far away from the dense area of points. Thus, it is difficult to distinguish the rockburst grade based only on the value of a single feature. According to the rockburst classification standard in Table 2, single features were used to judge the rockburst grades of the dataset. The accuracy rates of σ θ , σ c , σ t , W et , SR, and BR were 0.48, 0.35, 0.35, 0.33, 0.39, and 0.44, respectively. Therefore, the single features have low accuracy for judging the grade of rockburst cases. The root cause of this phenomenon is the failure of rockburst grades to fully reflect the influence of rockburst control factors. Therefore, comprehensively judging the grades of rockburst cases based on multiple features can provide higher accuracy.

Yeo-Johnson transformation and balancing of the rockburst data
Yeo-Johnson transformation. The preprocessing of the rockburst dataset mainly solves the problems of a large number of outliers in the dataset and the imbalance in the number of samples of each rockburst grade. For outliers in the dataset. Tan et al. 29 proposed the LOF algorithm to detect and remove outliers in a dataset. Yin et al. 25 proposed the LOF algorithm to detect outliers in a dataset and replaced them using the expectation maximization (EM) algorithm. Properly removing or replacing outliers in the spatial distribution of rockburst samples can effectively improve the data structure and the rockburst prediction ability. However, outliers in the rockburst dataset are inherent attributes. The elimination and replacement of outliers may destroy the original characteristics of the data and ignore a small number of objective laws.
In view of this, the Yeo-Johnson 34 transform was proposed to process the rockburst features. This method is a power transformation, which is often used in the data preprocessing stage of data mining and machine learning. It can reduce the heteroscedasticity of rockburst features and amplify the normality, thus resulting in a probability density function that is closer to a normal distribution. Compared with directly removing or replacing outliers, the Yeo-Johnson transformation retains outliers in the original dataset, improves the data structure, and reduces the influence of outliers on the prediction results.
The Yeo-Johnson transformation is defined as follows: where y is the rockburst feature data, and is the parameter estimated by the maximum likelihood method. High-dimensional digital features are difficult to display intuitively in space. Therefore, to illustrate the effect of the transformation, feature σ θ with a smaller coefficient of variation and feature SR with a larger coefficient of variation are selected to construct the scatter plots in Figs. 3 and 4. Figure 3 shows the original data without scaling, whereas Fig. 4 shows the data after the Yeo-Johnson transformation. Surrounding the scatter plot is the  www.nature.com/scientificreports/ Rockburst data balancing based on K-means SMOTE. The frequencies of different rockburst grades are quite different, resulting in unbalanced rockburst data. There are two main approaches to address data imbalance: data level (oversampling, undersampling, and mixed sampling) and algorithm level 35 . At the data level, sampling methods are used to increase or decrease various rockburst samples to balance the dataset. The algorithm-level approach employs algorithms that are not sensitive to unbalanced datasets, such as Extra Tree, random forest (RF), and CatBoost. Insufficient attention has been paid to the records of rockburst cases in engineering practice. There are only 275 rockburst cases in our dataset, of which 33 rockburst cases are grade IV. The under-sampling method can easily cause the loss of useful information, which leads to a decrease in the model accuracy. Therefore, oversampling method is more suitable. Machine learning algorithms usually require as large a dataset as possible. Oversampling methods can be divided into random and informed methods to generate oversampled samples 36 . Randomly generated oversampling samples can easily destroy the data structure and result in model overfitting. Among the informed generation methods for oversampling, the SMOTE algorithm can avoid overfitting. However, it may introduce noise to the dataset 37 . The borderline-SMOTE 38 algorithm divides the data into three types: safety, danger, and noise. Only a few dangerous samples are oversampled, and thus no noise data will be generated. However, the algorithm has weaknesses in dealing between the within-class imbalance. Between-class means that the imbalance of the data sample numbers between the minority class and the majority class. Withinclass imbalance means that the imbalance of the distribution position or distribution density of the sample.
Therefore, the K-means SMOTE algorithm is proposed to oversample the rockburst dataset after the Yeo-Johnson transformation. The K-means SMOTE algorithm consists of three steps: clustering, filtering, and oversampling. The clustering step divides the rockburst data into k clusters using the K-means algorithm. The  www.nature.com/scientificreports/ filtering step retains clusters with a high proportion of minority samples, and then synthesizes more minority samples in sparse clusters. The oversampling step performs SMOTE oversampling on the clusters with a low density of minority samples. The sparser the minority samples in the cluster are, the more minority samples will be added. The algorithm identifies a sparse sample area by calculating the average distance of minority samples in a cluster, and generates more samples in the sparse sample area, which reduces the within-class imbalance 39 . The calculation steps for K-means clustering are as follows: (1) Suppose the input dataset is D = {x 1 , x 2 , . . . , x m } , and the division of clusters is C = {C 1 , C 2 , . . . , C k } . Randomly select k samples from dataset D as the initial k centroid vectors, 2 between all sample points, x i , and each centroid vector, µ k ; divide the sample points into the nearest cluster, x i ∈ C nearest ; and update the cluster, (4) Repeat calculation steps (2) and (3) until all of the centroid vectors, µ k , remain constant; output C.
The filtering step selects clusters with a high proportion of minority samples. The oversampling step is performed as follows: (1) For each filtered cluster, C i , calculate the Euclidean distance matrix, ignoring the majority samples.
(2) Compute the mean distance, d(C i ), within each cluster by summing all non-diagonal elements of the distance matrix, and then dividing by the number of non-diagonal elements. (3) Compute the density of each filtered cluster as density( (6) Perform SMOTE oversampling for each filtered cluster. New samples are generated by interpolation from the minority samples in the cluster: In the filtered clusters, based on the sparseness of the minority samples, generate r(C k ) × m new samples, where x is a newly generated sample, a is a randomly selected minority sample in the cluster, b is the nearest neighbor minority sample of a , and m is the total number of samples in dataset D.
To compare the sensitivities of different algorithms to imbalanced datasets and the generalization abilities of various algorithms, the prediction results of 13 machine learning algorithms are compared for both the original rockburst dataset and rockburst dataset preprocessed by the Yeo-Johnson transform and K-Means SMOTE oversampling. The 13 machine learning algorithms considered are the support vector classifier (SVC), decision tree (DT), K-nearest neighbor (KNN), naive Bayes classifier (NBM), Gaussian processes (GP), multi-layer perceptron (MLP), quadratic discriminant analysis (QDA), random forest (RF), gradient boosting (GB), extreme gradient boosting (XgBoost), light boosting (LightBoost), extra tree (ET), and CatBoost. The accuracy, precision, recall rate, and F1 values are obtained for the training set and test set prediction results for each rockburst grade. Table 3 lists the prediction results obtained with the original rockburst dataset. Table 4 lists the prediction results obtained with the rockburst dataset after preprocessing. Stratified sampling of the dataset is used to divide the training and test sets such that the proportion of rockburst samples of each grade is consistent in the training and test sets. Three-quarters of the dataset is used as the training set to train the model, and the remaining 1/4 is used as the test set to evaluate the reliability and generalization ability of the model. In the model training process, grid search with cross-validation is used to obtain the optimal parameters with the highest accuracy.
As can be seen from Tables 3 and 4 The results show that the rockburst dataset without data preprocessing has poor overall prediction results. In particular, the prediction results are lowest for the most hazardous grade IV rockbursts. After the Yeo-Johnson transformation and K-means SMOTE oversampling, the data structure is significantly improved, and a large number of outliers and data imbalance problems in the datasets for each rockburst grade are effectively addressed, thus improving the generalization ability of the model.

Rockburst data feature analysis and determination of the optimal rockburst feature dimensions
Rockburst data feature analysis. Breiman 40 noted that an improvement in accuracy requires a more complex prediction model. It is usually difficult to achieve the best prediction accuracy using simple and interpretable models. However, complex machine learning algorithms inevitably have black box properties. To pro- www.nature.com/scientificreports/ vide complex black box models with some interpretability, it is convenient to analyze the role of each feature in the prediction process. The ET and KNN models with the highest accuracy are used to evaluate the importance of features, and the importance of features is measured by the method of mean decrease accuracy. The method of reducing the average accuracy rate directly measures the impact of each feature on the accuracy of the model, by disrupting the order of the feature values of each feature, and measuring the impact of sequence changes on the accuracy of the model. For unimportant features, shuffling the order has little effect on the accuracy of the model. But for important features, disrupting the order will significantly reduce the accuracy of the model. Figure 5 shows the degree of mean decrease accuracy of the ET and KNN models, and the lines indicate the fluctuation range of the error.
In Correlation analysis of the features is performed to calculate the degree of correlation between two variables and analyze the degree of information redundancy contained in rockburst features 41 . Completely correlated variables represent truly redundant information, and adding completely correlated variables will not introduce additional information. Therefore, most scholars believe that redundant information contained in rockburst features will lead to poor model prediction results 42,43 . The PCA dimensionality reduction method is used to eliminate the correlations between rockburst features, which can eliminate features that contain less information. The Pearson correlation coefficient evaluates the linear relationship between two variables as follows: where x i is the i-th sample value of a certain rockburst feature, y i is the i-th sample value of another rockburst feature, and N is the total number of samples.
The feature Pearson correlation coefficients are calculated for the preprocessed rockburst dataset, and the results are shown in Fig. 6. In general, if the absolute value of the Pearson's correlation coefficient is within 0.8-1.0, the two variables are considered very strongly correlated; if the Pearson's correlation coefficient is within 0.6-0.8, 0.4-0.6, 0.2-0.4, and 0-0.2, the two variables are considered strongly correlated, moderately correlated, weakly correlated, and very weakly/not correlated, respectively. Figure 6 shows that there are no extremely strong correlations between rockburst features. There are strong correlations between σ θ and SR, σ c and σ t , σ c and W et , and σ t and BR. Therefore, the rockburst features have partial information redundancy. In general, there is not an excessive amount of redundancy in the rockburst dataset, and each feature carries some unique information.
Determination of the optimal rockburst feature dimension. In engineering practice, there are many factors affecting rockbursts, and there are more than ten corresponding indexes. However, in engineering practice, the record of rockburst cases has not received sufficient attention or it is difficult to obtain some features, which leads to a lack of some rockburst indicators in the available rockburst statistics, such as the point load strength of rock ( I s ), deformation before the peak strength of rock ( U ), and stiffness of the loading process on the stress-strain curve ( K m ). Relying only on six rockburst indicators ( σ θ , σ c , σ t , W et , SR, and BR) may have problems that cannot fully reflect the rockburst phenomenon. In general, before the curse of dimensionality, the more features are considered, the easier it is for the decision boundaries of the model to distinguish different categories, and the better the classification effect will be. If all features are predictive to a certain extent and the features are not completely correlated, then an appropriate increase in the number of features can improve the prediction ability 44,45 . Vong et al. 46 noted that when the classified features resemble a family structure, the dataset will have a certain immunity to the curse of dimensionality, and an appropriate increase in the number of fea- Figure 5. The mean decrease accuracy graph of ET and KNN models. www.nature.com/scientificreports/ tures is beneficial. However, the curse of dimensionality problem occurs when the data are high-dimensional 47 . It affects the learning process and reduces the accuracy.
To determine whether the rockburst dataset follows a family structure and the optimal number of classification features, first, on the basis of the six rockburst features, the PolynomialFeatures function in Sklearn is used to generate 21 new polynomial features. The method used to generate 21 new features is {a, b, a 2 , ab, b 2 } , where a is any feature of the original rockburst features, b is another arbitrary feature of the original rockburst features, and a 2 , ab, and b 2 are newly generated features. Second, new features are generated from the original rockburst features, and these features inevitably have a strong correlation and excessive redundant information. Hence, PCA dimensionality reduction is used to process these 27 features, and the 27 principal components after PCA processing are retained. Finally, according to the amount of information contained in the principal components, principal components with less information are sequentially eliminated, and 26 rockburst datasets are constructed. These rockburst datasets contain two principal components, three principal components, etc., up to 27 principal components. Thirteen machine learning algorithms, SVC, DT, KNN, NBM, GP, MLP, QDA, RF, GB, XgBoost, LightBoost, ET, and CatBoost, are used to classify and predict the 26 rockburst datasets, resulting in a total of 26 × 13 = 338 classification prediction models.
PCA has three main functions. (1) When the number of samples is fixed and the features of the samples increase, the spatial distribution of samples becomes increasingly sparse, which leads to model overfitting. The PCA algorithm increases the sample density by discarding part of the information and alleviates the curse of dimensionality. (2) When the rockburst dataset is affected by noise, features with less information are often related to the noise, and eliminating features with little information can reduce noise. (3) In the rockburst dataset after PCA dimensionality reduction, each rockburst feature is independent of the others.
The PCA calculation steps are as follows: (1) Assume that the input dataset is D = {x Because the classification performance of the 13 machine learning models is inconsistent, the model prediction ability will be the best for different numbers of principal components. Moreover, when using the stacking algorithm to integrate multiple models, the number of principal components must be consistent for all models. The quality of the model and the generalization ability are reflected in the test set. Therefore, the average prediction accuracy of 13 models in 26 datasets for the statistical test set is used as the basis for determining the optimal number of classification features. The results are shown in Fig. 7. In this figure, the numbers 2 to 27 on the abscissa represent datasets containing 2 to 27 principal components, and the ordinate represents the average prediction accuracy of the 13 models for the test set.
As shown in Fig. 7, the dataset with 14 retained principal components had the highest average prediction accuracy of 0.7790. The prediction accuracy rates of SVC, DT, KNN, NBM, GP, MLP, QDA, GB, XgBoost,   (Table 4). To illustrate the relationship between the average prediction accuracy and the number of retained principal components, Fig. 7 shows two auxiliary lines with dashed arrows. In this figure, the accuracy with less than 14 retained principal components exhibits a fluctuating and gradually increasing trend, and the accuracy with greater than 14 retained principal components exhibits a fluctuating gradual decline. An appropriate increase in the number of independent principal components can improve the accuracy of rockburst prediction, and the rockburst dataset has certain characteristics of mitigating dimensional problems. This shows that the rockburst dataset conforms to the family structure described by Vong et al. 46 , and an appropriate increase in the number of rockburst features can improve the rockburst prediction.

Rockburst prediction based on ensemble stacking
After the original rockburst dataset has undergone the Yeo-Johnson transformation, K-means SMOTE oversampling, and rockburst feature combinations to derive new features, the learning capabilities of the 13 machine learning models are improved to varying degrees. To further improve the accuracy of the rockburst prediction, stacking technology in ensemble learning is used to combine multiple machine learning methods to improve the model learning performance 48 . The ensemble stacking is divided into two layers. The first layer is fitted with multiple base models to output new features. The second layer uses the output of the first layer as the input. This stacking method for combining multiple learners is a type of meta-learning, which means learning to learn. The stacking calculation process is divided into three steps, and the flowchart is shown in Fig. 8.
Step 1: First, select n machine learning models as the base model. The dataset with 14 rockburst features is divided into a training set (75%) and a test set (25%), and the training set is divided into five parts that are not crossed. Second, one of Train 1, Train 2, Train 3, Train 4, and Train 5 in the training set is used as the validation set, and the remaining four datasets are used as the training set. Third, the base model performs five-fold crossvalidation training on 75% of the training set and makes predictions based on the test set. Therefore, each set of Train data in the training set has a corresponding Predict value. Finally, each set of Train data in the training set is stacked, and new features generated by the base model are obtained from the training set; n base models generate n new features.
Step 2: Stack the n new features generated in Step 1 vertically for the training set and test set to obtain a new rockburst dataset.
Step 3: To prevent model overfitting, a logistic regression learner is used to train and predict the rockburst dataset with new features.
The XgBoost model has the highest accuracy in section "Determination of the optimal rockburst feature dimension". Therefore, to demonstrate the advantages of rockburst prediction based on stacking, the confusion matrixes of the XgBoost model and stacking model for the test set are shown in Fig. 9. The abscissa in this figure represents the predicted result for each rockburst grade, and the ordinate represents the true result for each rockburst grade. The diagonal position of the XgBoost model confusion matrix shows that the correct prediction numbers for no rockburst, weak rockburst, moderate rockburst, and strong rockburst events are 29, 23, 22, and 24, respectively. The diagonal position of the stacking model confusion matrix shows that the correct prediction numbers for no rockburst, weak rockburst, moderate rockburst, and strong rockburst events are 30, 23, 22, and 26, respectively. The results show that the stacking model has a stronger generalization ability for no rockbursts and strong rockbursts, and its accuracy is higher than that of the highest accuracy XgBoost model.  (2) To address the phenomena of outliers and data imbalance in the rockburst dataset, the Yeo-Johnson transformation is proposed to normalize the data distribution and reduce the interval between outliers and the cluster area, thereby reducing the impact of outliers on the forecast results. The K-means SMOTE algorithm is used to oversample the rockburst data set after the Yeo-Johnson transformation to ensure the rockburst samples attain both within-class balance and between-class balance. After data processing through the Yeo-Johnson transform and K-means SMOTE oversampling, the prediction accuracy of 13 single machine learning algorithm models is increased by an average of 0.1638. (3) Rockburst data has a family resemblance structure. Therefore, an appropriate increase in the number of features can improve or maintain the prediction ability. A method of multiplying two-by-two based on six original features and squaring a single original feature is adopted to generate 21 new features and construct a dataset with 27 rockburst features. Then, PCA technology is used to eliminate the correlations between features, ensuring each feature is independent of the others. The exhaustive method selects the number of features that produces the highest average accuracy of the 13 machine learning algorithms, and the average accuracy of the rockburst dataset with 14 features is 0.7790. (4) After the Yeo-Johnson transformation, K-means SMOTE oversampling, and determination of the optimal rockburst feature dimension of the original rockburst dataset, the rockburst data structure is significantly improved. To further improve the accuracy of rockburst prediction, the prediction ability is improved at the algorithm level. Fourteen rockburst features are used as the input for stacking; multiple machine learning algorithms are used as the first-level base model, and a logistic regression classifier is used as the secondlevel output model. Compared with 13 single machine learning models optimized for the data structure, the ensemble stacking model has an average prediction accuracy improvement of 0.0769.

Data availability
All data that support the findings of this study are available from the corresponding author upon reasonable request. www.nature.com/scientificreports/